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'. ABSTRACT: 

^ ' The genome-wide recombination rate (RR) of a species is often described by one parameter, the ratio 

QQ [ between total genetic map length (G) and physical map length (P), measured in centiMorgans per Megabase 

i (cM/Mb). The value of this parameter varies greatly between species, but the cause for these differences is 

1 not entirely clear. A constraining factor of overall RR in a species, which may cause increased RR for smaller 

■ chromosomes, is the requirement of at least one chiasma per chromosome (or chromosomc-arm) per meiosis. 

o ■ 
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In the present study, we quantify the relative excess of recombination events on smaller chromosomes by a 
linear regression model, which relates the genetic length of chromosomes to their physical length. Wc find for 
several species that the two-parameter regression, G = Gq + k ■ P provides a better characterization of the 
relationship between genetic and physical map length than the one-parameter regression that runs through the 
origin. A non-zero intercept (Gq) indicates a relative excess of recombination on smaller chromosomes in a 
genome. Given Gq, the parameter k predicts the increase of genetic map length over the increase of physical 
map length. The observed values of Go have a similar magnitude for diverse species, whereas k varies by two 
orders of magnitude. The implications of this strategy for the genetic maps of human, mouse, rat, chicken, 
honeybee, worm and yeast are discussed. 
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Introduction 

The rate of meiotic recombination rate (RR), defined as the ratio between genetic and 
physical map length and measured in centiMorgan per Megabase (cM/Mb), is known to vary 
widely between the genomes of different species. As a rule of thu mb for the human genome , 



IcM genetic map 



Ulgen and Li 



ength equals 1Mb physical map length, see e.g. fjCollins and Morton 



1998 



20051). This rate is about tw ice as large as the genome-wide RR observed in the 



served in the yeast genome flMortimer et al. 



mouse genome (jJense-Seaman et al.ll2004t). but far less than the RR of 340cM /Mb that is ob 



1992 



Baudat and Nicolas 



19971 ). Understanding 



of these differences in RR between different species is of fundamental importance for evolution- 



ary and medical genetics (INachman 



2OO2I ). In addition to these differences between species. 



it was also noted that RR di ffers between chromosornes w i thin a species, with smaller chro- 



mosomes showing higher RR (INachrnan and Churchill 



2001 



Venter et al 



2001 



Kong et al 



2002 



1996 



Matise et al 



Broman et al. 



1998 



Lander et al. 



20071 ). Therefore, species differences 



in genome-wide RR may be best studied under a model that also considers the intragenomic 
differences between chromosomes. 

From a population genetic perspective, the main role of recombination is the produc- 



efficiency of natural selection in theoretica 


anc 


empirical model systems 1 


Mavnard-Smith 


1978 




Barton and Charlesworth 


1998 




Rice 


2002; 


Otto and Lenormand 


2002 


). Many recent 



empirical studies have addressed t he question at which sites in a genome recombina t ion is 



most 



2005 



ikely to occur 



Mancera et al. 



fast on a kb- scale (jPtak et al 



Petes 



2001 



McVean et al 



2004 



Hey 



2004 



Myers et al. 



2005 



Coop 



20081). In this context it was also fo und that RR evolves extremely 



2005 



Winckler et al 



2OO5I ) and that historical recombina- 



tion hotspots are associated with specific gene functions in human, which was hypothesizec 



to indicate an infiuence of natural selection on hotspot locations (IFreudenberg et al. 



2007 



The International HapMap Consortiumll2007l ). When RR is examined at a megabase sca le in- 



stead of a kilobase scale, the evolution of local RR is more constrained flMvers et al. 



differs much less 



2005 



between closely related species, such as human and chimpanzees (j Winckler et al. 



Ptak et al. 



20051). and 



2OO5I ). However, the mechanism behind this conservation of RR on the larger 



Li and Freudenberg 



3 



scale is unclear. One contributing explanation could be the requirement of a minimal or fixed 
numb er of chiasma ta per chromosome during meiosis to stabilize homologous chromosome 
pairs jMatherlllQSsI ). 

The question how many chiasmata are exactly required per chromosome o r per chromosome 



arm has not been resolve d yet and might not have a generally valid answer (ILynn et al 



2004 



Laurie and Hulten 



19851 ). Nevertheless, these meiotic constraints can explain the excess of 



recombination on shorter chromosomes. Consistent with an influence of karyotype on over- 



all recombination rate, a correlation w as found between the nurn ber o: 
a genome and the genetic ma p length (De Villena and Sapienza 



tion may lead to aneuploidy (IHassold and Hunt 



2001 



Lynn et al. 



chromosome arms in 



2001a). Altered recombina- 



|200J), which may impose 



strong selective cons traints and explain the tight relationship between karyotype structur e and 



recombination rate (IDumas and Britton-Davidian 



2002 



De Villena and Sapienzall2001bl ). 



On the other hand, domestic ated plants and animals show evidence for increased chiasma 



formation (IBurt and Bell 



19871 ). which suggests that there exist additional determinants of 
genome-wide RR than karyot ype. For instance, th e level of interference in chiasma formation 
could differ between species (IBroman et al.ll2002l ). Therefore, it would be useful to apply a 
formal method that separates the contribution of karyotype structure from the relationship 
between physical and genetic map length. This is not accomplished by the genome-wide 
cM/Mb ratio: although the cM/Mb ratio is a convenient single parameter measurement, it 
does not model the higher contribution of smaller chromosomes to the genome-wide RR of a 
species. 

To address this problem and better understand the overall RR of a genome, we propose 
a novel strategy that explicitly models, if and to what extend, the overall RR in a genome 
is influenced by the relative excess of recombination on smaller chromosomes. This proposed 
two-parameter strategy takes into account that a certain minimal amo unt of recombination 
is required to maintain genome integrity during meiosis JlVIatheJ Il938l ) and that a genome 
therefore has minimal genetic map length. This idea becomes more clear, if we use a statistical 
regression framework to compare the proposed strategy with the one parameter strategy that 
is typically applied to shorter scales than the chromosome-scale. Since the one-parameter 
characterization of RR implies that genetic length is proportional to the physical length and 
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recombination events occur independently on different chromosomes, the cM/Mb ratio is the 
slope of the linear regression of genetic lengths of chromosomes over their sequence lengths, 
with the requirement that the regression line goes through the origin. In our new approach, we 
drop the requirement that the regression line must go through origin by using two parameters 
to fit the genome-wide genetic map information at the chromosomal scale. 

From a biological perspective, the one-parameter model considers the length of the genetic 
map of a genome to be determined by the length of the underlying physical map and the 
species-specific RR. Building on this, the two-parameter model also includes a separate effect 
of karyotype structure that may produce a disportional distribution of recombination events 
over chromosomes of different length. Under the two-parameter model, the value of the y- 
intercept quantifies the relative excess of recombination events on a hypothetical chromosome 
with length zero, whereas the slope of the regression measures the increase of genetic with 
physical map length in the same way as the one-parameter model. Our results show that in 
human, as well as other species, the two-parameter regression provides a much better fit for 
describing the genetic map length of chromosomes. 



Results 



A two-parameter regression model fits the genetic map length of human chromo- 
somes better than the one-parameter model 

To look for systematic differences in recombination rate between human ch romosomes, we 



started by reproducing the Marey map (IChakravarti 



1991 



Rezvoy et al 



20071). a c umulative 



plot similar to those used in DNA sequence representation or analysis (iLil 119971 : iGrigoriev 
19981 ). for 22 human autosomes and 34 arms of metacentric chromosomes (Appendix Figure 
Al). The chromosome-scale or chromosome-arm scale recombination rate may be defined as 
the slope of a straight line that links the first and the last marker. For smaller chromosomes 
or chromosome arms, the end points in the Marey map tend to lie above the line with a slope 



equal to 1 ('c M=Mb), i.e., sma. 



Figure 16 of (ILander et al.ll200ll ) and Table 12 of (jVenter et al. 



ler chromosomes (- arms) have 



arger 



200 ih ). 



cM/Mb ratios (see also 



We next regressed the genetic map length of chromosomes over their physical map length 
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Figure 1: Two-parameter regression of human genetic length over physical length. (A) Analysis at the 
chromosome scale. Female (red) , male (blue) , and sex-averaged (solid circle) genetic length of each chromosome 
(in cM) is plotted against its physical length (in Mb). The least-square regression lines are: y = 54.2 -I- 1.02a; 
(female), y = 42.0 + 0.52a; (male), y ~ 48.1 + 0.78a; (sex-average). (B) Analysis of metacentric chromosome 
at the chromosome-arm scale. The best fit regression lines arc: y = 29.0 + 1.05a; (female), y = 27.1 + 0.48a; 
(male), y = 28.0 + 0.77a; (sex-average). 



(Figure [H^ A), similar plot can be found in (IHousworth and Stalilll2003l )). When sex-averaged, 
female and male genetic lengths are fitted separately, the three regression lines are described 
by: 



^ch,sex—ave,human 
Gr ch, female, human 
G ch,male, human 



48.1 + 0.78P 

54.2 + 1.02P 
42.0 + 0.53P 



(1) 



The normality assumption of regression residuals was tested graphically by a QQ-plot (Ap- 
pendix Figure A2), and the normality condition does not seem to be violated. 
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Equation ([T]) shows that the y-intercept Gq for female data is 29% larger than for male 
data, whereas the slope k is 92% larger. Thus, the different length of the male and female 
map mainly manifests as a different slope and less so as a different y-intercept. As can be seen 
from Figure [U^A), all human chromosomes exceed the minimal length of 50cM both for the 
male and the female genetic map. 

To test the robustness of the y-intercept value, we added random noise to the genetic map 
length and repeated the regression analysis. The histogram of 50000 y-intercepts from this 
procedure is shown in Appendix Figure A3. Although values of Gq range from 35 to 60, they 
are all far from zero. 

We next repeated the analysis using chromosome arms instead of full chromosome as sepa- 
rate data points (Figure [l](B)). This leads to the regression equations: 

G arm, sex— ave, human 28.0 -|- 0.77P 

G arm, female, human 29.0 -|- 1.05-P 

G arm,male, human 27.1 -|- 0.48-P. (2) 

The y intercepts at chromosome arm-scale regression is now reduced to somewhat more 
than half of the intercept at the full chromosome scale. This reduction shows that cytogenetic 
constraints exert a smaller influence on the chromosome-arm scale than on the full chromosome 
scale. 

Several methods can be used to show that the two-parameter regression model fits the data 
better than the one-parameter regressions. To this end, we first compared the coefficient of 
determination which is the proportion of variability explained by the regression model. The 
observed values of the one-parameter regression range between 0.48 and 0.87, whereas the 
R^ values of the two-parameter regressions range between 0.86 and 0.98 (Table 1), indicating 
that the two-parameter regression explains more of the variability in the data. 

We further cast the comparison between the one- and two-parameter regression as a model 



selection problem. T wo such mode l comparison strategies are provided by the Akaikejn 



tion criterion (AIC) (jAkaikelll974l ) and Bayesian information criterion (BIG) (jSchwarz 



orma- 



19781). 



Both AIC and BIG values for the two-parameter regression model are smaller than those for 
the one-parameter model, indicating a better statistical model (Appendix Table Al). 
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Finally, we tested the null hypothesis that Go is zero. The p-values in this test range 
between 10~^^ and 10~^ (Table 1). Because the null hypothesis is that all chromosomes have 
the same RR, a simulated distribution of Gq can be obtained from the regression over data 
that are obtained by the permuting of chromosome-specific cM/Mb ratios, while leaving the 
physical chromosome length unchanged. Out of 50000 such permutations, only two showed 
a Go value that is larger than the observed value of 48.1 (for sex-averaged full chromosome 
data), corresponding to a p- value of 4x 10~^. To summarize, all evaluation methods support the 
conclusion that the two-parameter regression model is better than the one-parameter model. 







two-parameter 


one-parameter 


p-value for 










testing Go = 


human chromosome, 


sex-averaged 


0.976 


0.817 


3.1 xlO"" 




female 


0.984 


0.866 


8.6 xlO-" 




male 


0.942 


0.694 


1.2 xlQ-^ 


human chromosome arm, 


sex-averaged 


0.955 


0.775 


9.3 X 10-13 




female 


0.964 


0.861 


7.1 xlO-" 




male 


0.864 


0.480 


7.7 xlO-" 



Table 1: Comparison of the two-parameter and one-parameter regression models for human genetic length, 
at the chromosome scale (22 data points) and the chromosome- arm scale (34 data points): coefficient of 
determination (i?^) for 2- and 1-parameter regressions, and p-value for testing the null hypothesis of zero 
y- intercept. 



As the deCode data were published more than six years ago, we further t ested the chromosom e- 
scale regression strategy on a more recent dataset, the Rutgers Map v.2 (IMatise et al.l 120071 ). 
The regression lines are G = 53.33 -|- 0.87P (sex- aver age), G = 50.29 -|- 1.19P (female), and 
G = 57.58 -|- 0.57P (male), respectively. These results are consistent with the parameter esti- 
mations in Eq.([T]), again showing that male and female data differ more in the slope than in 
the y-intercept. 

Two other quantities can be derived from Gq that help to interpret the y-intercept parame- 
ter. The first is the physical length Pmin on the regression line that corresponds to a specified 
minimum genetic length Gmin such that Gmin = Gq + kPmin- If we set Gmin = 50cM, then 



we obtain P„ 



2.45Mb for sex-averaged chromosome data. One may assume that for any 
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hypothetical chromosome with P < Pmin, its genetic length G remains constant at 50cM and 
does not decrease for shorter chromosome length. As the second quantity of interest, we define 
the percentage of genetic length that is explained by the inclusion of Gq into the model as: 
a = 226*0/ (226*0 + k Yli=i Pi)- For the sex-averaged data, we find that a = 31% of variability 
is explained by the y-intercept. This value can also be obtained from the decomposition of 
= ESi Gi/ E£i = 22 ■ Go/ Pi + k- 1-13= 0.35 + 0.78, because 0.35/1.13 = 31%. 
The relatively large percentage value once again highlights the importance of the y-intercept 
Go for modeling chromosome-scale recombination rate in human. 

Different intercept but similar slope in the two-parameter regression models for 
rat and mouse chromosomes 

Both rat {Rattus norvegicus) a nd mouse (Mus musculus) genome are known to have lower 



recombination rates than human (jJense-Seaman et al.ll2004l ). with rat having a higher overall 



RR than mouse. The rat genome has a roughly equal physical map length, but contains one 
more chromosome (n=20) than the mouse genome (n=19). Furthermore, rat chromosomes 
show a greater heterogeneity in their physical length and one may hypothesize that these 
karyotype differences contribute to the somewhat higher RR in rat (0.62 cM/Mb vs. 0.57 
cM/Mb in mouse). The regression models of the sex- averaged genetic length of rat and mouse 
chromosomes over their sequence lengths (Figure [2]^ A)) are: 

G ch,sex—ave,rat 22.49 -|- 0.43-P 

G ch,sex—ave,mouse 15.62 -|- 0.44/^. (3) 

These models display a similar slope and the different overall RR of rat and mouse mainly 
manifests different intercept value Gq. 

Testing Gq = for the rat genome is significant (j9-value= 0.0012), whereas testing Gq = 
for mouse genome (the fitted Go value for mouse is 69% of that for rat) is not significant (p- value 
= 0.11). AIC/BIC calculation confirms that the two-parameter regression is a convincingly 
better model for rat than the one-parameter regression, whereas this barely holds for the 
mouse data (Appendix Table Al). Thus, the mouse genome displays a non-significant excess 
of recombination on smaller chromosomes, which is consistent with the smaller variation of 
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chromosome size in the mouse genome. The greater y-intercept for the rat genome supports 
the hypothesis that cytogenetic factors contribute more to the genetic map length of rat than 
mouse. 

Because the rat karyotype consists of both meta- and acrocentric chromosomes, we repeated 
the analysis after splitting all metacentric rat chromosome into two parts, based on the location 
of the centromere. Different from the human genome, for the rat genome this mainly affects 
the smaller chromosomes, which are often metacentric. The regression line is now described 
by G = 7.48 + 0.52P and testing the intercept is still significant (p-value=0.024), though at 
a less stringent level. Thus, at the scale of chromosome-arms, the likelihood of crossovers in 
an interval is more determined by its physical length and less by influenced by any obligate 
recombination requirements. 



Recombination rate of small and large chromosomes in the chicken genome 



The chicke n (Gallus callus) genome consists of both large (macro-) and small (micro 



chromosomes (IHillier et al 



2004 



Smith et al 



200ol ). with length ranging from a few Mb to 



close to 200Mb. The two-parameter regression model for the chicken genetic data in Figure 
[2KB) leads to: 

G ch,sex—ave,chicken 34.68 -|- 2,.7dP. (4) 

In this regression model, the non-zero intercept is significant with a a p- value of 1.33 x 10"^ 
and there is a considerable difference of AIC/BIC for the one- and two-parameter regression 
favoring the two-parameter model (Appendix Table Al). Both coefficients of determination 
for the 2- and 1-parameter regressions attain a high value: 0.98 and 0.93 respectively. A 
reason that the 1-parameter regression only marginally reduces the value is given by the 
fact that larger chromosomes contribute much more to the total variance, which is equally well 
captured by the 1-parameter model. Thus, two of the three methods confirm a relative excess 
of recombination on short chromosomes. 

However, the orders of magnitude difference betwe en the size of chick en chromosomes raises 



the question of the robustness of the regression. From (IHillier et al 



2004J ) and Appendix Figure 



A4, it is clear that the genetic length reaches a plateau at the level of 50cM for microchromo- 
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somes smaller than 8Mb. When chromosomes below a certain length threshold are discarded 
from the regression analysis, the y-intercept value changes slightly, but not dramatically. For 
example, if the length thresholds for removal are 8Mb and 25Mb, Gq for Eq.(j4]) decreases to 
32.95 and 31.88. When the regression model is only fitted to the five largest chromosomes 
(longer than 50Mb), the model parameters are G = 26.22 + 2.84P. On the other hand, if we 
remove the largest five chromosomes, the regression line is G = 31.86 + 3.01P. 

To see how the quality of the map distance me asurements may influ ence these results, we 



next looked at the recently updated chicken map (IGroenen et al.ll2009l ). which contains more 



genetic markers and higher marker density. Applying the two-parameter regression leads to 

G ch,sex—ave,chicken 34.23 -|- 2.04P. (5) 



As c an be seen, the overall reduction of RR as compared to the older map (iHillier et al. 



20041 ) mainly manifests as reduced estimate of k, whereas the estimate of Gq remains almost 



unchanged. 

Exceptionally high recombination on the largest honey bee chromosome leads to 
a better fit of the one-parameter than the two-parameter model 

Notably, the two-parameter reg ression does not p rovide a better fit for the genetic map data 



from honey-bee(Apis mellifera) (IBeye et al.ll2006l ) than the one-parameter model. When plot 



ting the genetic length over physical length (Figure [2](C)), the y-intercept of the regression line 
does not significantly differ from zero (p- value = 0.81): 

G ch,sex-ave,bee = —4.22 + 23.49P. (6) 

The coefficient of determination for both the two- and one-parameter regression is around 0.95. 
In contrast to other genomes, AIC/BIC analysis favors the one-parameter regression model 
(Appendix Table Al). 

As can be seen from Figure [2](C), the longest chromosome (chromosome 1) is four times the 
length of the shortest chromosome, and the regression result may depend on the presence of this 
"outlier" . To check this possibility, we repeated the analysis after chromosome 1 was removed, 
which led to the regression equation: G = 28.71 + 20.36P. However, also in this model. 
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testing Go = is not significant (p-value=0.37), both the two- and one-parameter regressions 
exhibit similar coefficient of determination (i?^ = 0.80, 0.79), and the zero-intercept regression 
is still the better model according to AIC/BIC analysis (Appendix Table Al). Therefore, 
different from other species, the honey-bee genome does not display any significant excess of 
recombination on smaller chromosomes. 



Two-parameter regression at much shorter length scales: the example of budding 
yeast 

Yeast (S. cerevisiae) has been extensively used to study the molecular m achinery of recomb i- 



nation and it has a much smaller (~ 12Mb) and more compact genome (ICherry et al.l 119971 ). 
Although the physical length of yeast chromosomes only ranges from 200kb to 1.5Mb, their 
genetic length is between 100 and 500cM, even longer than the genetic length of human chro- 
mosomes. The best fitting regression line for the yeast genetic map is (Figure [2](D)): 

G ch,sex—ave,yeast 49.12 + 284. 74i^. (7) 

The non-zero y-intercept is significant (p- value = 0.009). The two-parameter regression is 
superior to the one-parameter model as judged by AIC/BIC (Appendix Table Al). The value 
of y-intercept, 49.12 cM is very close to 50cM which corresponds to almost one crossing over 
event for a hypothetical chromosome of physical length of zero. 

The extremely high recombination rate in the yeast genome is surprising. From the molecu- 
lar perspective, one can speculate about various hypotheses, such as a different meiotic regula- 
tory system which makes a denser spatial distribution of chiasmata possible, a lack of secondary 
chromatin structure as compared to higher organisms so that the actual physical distance be- 
tween two locations on chromosome is more or less equal to the linear sequence distance, or 
the lack of other supporting mechanism to hold chromatids together so that more chiasmata 
per chromosome arm are required for proper chromosome segregation. On the other hand, the 
y-intercept of the regression has a similar magnitude as that observed for higher organisms, 
indicating a similar relative excess of recombination on smaller chromosomes. 
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The difference between the central gene cluster and telomeric regions in worm 
genome is due to a difference in Gq 

Finally, we used genetic map data from the worm C. elegans to show that the two-parameter 
regression strategy can also be useful to compare different regions within a genome. The 
chromosomes of C. elegans are unusual, because discrete centromeres are missing and the 
chromosomes are holocentric, i.e . microtubules attach at many sites for chromatid segregation 



(iTyler-Smith and Floridiall2000l ). Accordingly, the Marey map analysis of the worm genome 
indicates that each worm chromosome can be partitioned in three regions: the central gene-rich 
regio n with a low recomb ination rate and two distal telomeric regions with high recombination 
rates (IBarnes et al.lll995l ). Therefore, we separately performed the regression analysis of genetic 
length over physical length for these two types of regions (Figure [3]). The fitted regression 
coefficients are: 

Gr central, worm 2.22 -|- 1.01-P 

Gdistal,worm = 18.39 + 0.94P (8) 

Within the single parameter framework without the intercept term Go, the two types of 
regions would have a very different cM/Mb ratio: 4.57 for telomeric regions, 0.68 for central 
regions. However, when allowing non-zero Gq value, the two regions display similar slope val- 
ues, 1.01 and 0.94. This indicates a constant excess of recombination in the distal region as 
compared to the central region in C. elegans, which is combined with a similar incremental 
cM/Mb ratio. Thus, after accounting for a fixed amount of recombination in a distal chro- 
mosome region, the likelihood of any additional recombination depends in similar strength on 
physical length in distal regions and central regions. 

As a note of caution, one may point out that the regression coefficient in Eq. ([8]) is obtained 
from only a few data points. Nevertheless, further regression diagnostics supports our conclu- 
sion. For example, testing Go = is significant for the distal regions (p-value= 0.006), but not 
significant for central regions (p-value=0.57). AIC/BIC analyses lead to the same conclusion 
(Appendix Table Al). 
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Discussion and Conclusions 



Our results show that instead of the simpler genetic-to-physical length ratio, the relationship 
between the physical and genetic map length at chromosome scale is better described by a 
statistical model that contains a second parameter Gq, which is the y-intercept of the regression 
of genetic map length over the physical chromosome length. A conceptually similar approach 
was used earlier in measuring the genome-wide reco mbination rate of a species by counting 



the chiasmata on each chromosome in excess of one ( Burt and Bell 



1987h . 



The consideration of this intercept parameter is important, becaus e karyotype structure has 



been established a s an important determinant of genome- wide RR 



2nnia 



Coopll2005l ) and smaller chromosomes display higher RR (iLander et al 



De Villena a 



2001 



nd Sapienza 



Hillier et al 



20041 ). Our proposed two-parameter model provides a formal expression of this size depen- 
dency of RR: RR = G/P = k + Go/P, i.e., a constant term k plus a second term that increases 
for smaller chromosome sizes P (if Gq is positive). This is what we observe for human, mouse, 
rat, chicken and yeast genomes. When writing Gq a.s Gq = G — kP, the ^/-intercept measures 
the amount of recombination after the physical map length has been accounted for. Therefore, 
one would expect that the total map length G of a chromosome increases by Gq after splitting 
it up into two separate parts. In fact, this has already been quan titatively observed for the 
experimental alteration of yeast chromosome I (IKaback et al.lll992l ). 

When comparing RR between species, the usage of k instead of the genome- wide cM/Mb 
ratio will reduce the influence of karyotype differences on the result. T his was also the inten tion 



behind the counting of chiasmata per chromosome in excess of one (IBurt and Belli 119871 ). In 



our study, the order of species remains unchanged, whether ranked by k or by cM/Mb ratio. 
However, due to the different values of k, we cannot use a single regression line to model the 
genetic-physical length relationship across species. Thus, a molecular mechanism must exist 
that drives, within a particular species, the proportional increase of genetic over physical map 
length. This mechanism might typically act with weaker strength in la rger genomes , which 
could contribute to the inverse correlation between genome size and RR ( lLynchll2006l ). 

Amoi ig mammals it was furtherrn ore found that RR is more similar for more closely related 



species ( jPumont and Payseur 



20081 ). which could be partly due to their similar karyotype. 
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It might be interesting to test where in the phylogenetic tree the signal might be altered, 
when using k instead of the genome-wide cM/Mb ratio. In this context, it is also important 



that genome-wid e -R-R t ypically differs between genders and ind ividuals (IBroman et al. 



Kong et al 



200 



1998 



2004 



Cheung et al. 



2007 



Petkov et al. 



20071 ). The biological factors that 



were invoked as possible explanations, such as differences in synaptonemal complex forma- 
tion or crossover interference, may be more plastic than karyotype structure. The respective 
strength of these factors could also contribute to species differences and may be better mea- 
sured by using k than by using the genome- wide cM/Mb ratio. 

If k would equal to zero with the obligate chiasma requirement holding true, then Go 
were required to be 50cM. This pattern can be observed for female Opossum {Monodelphis 
dome stica) , where each chr omosome acquires exactly one crossover near one of its telomeres 



(see (IMikkelsen et al. 



20071 ) and Appendix Figure A5). Similarly, very small chromosomes 



may always acquire exactly one crossover, despite reduced chromosome size, as seen for the 
microchromosomes in the chicken genome. In order to predict the transition from this plateau 
to the linear regression, we derived the minimum physical length parameter Pmin from a 
given Gmin and the estimated regression parameters. Note that if both physical and genetic 
length are measured as those in excess of Pmin and Gmin, their ratio is exactly equal to k: 

{G — Gmin) / {P~ Pmin) = {G — Gmin) / {P ^ (Gmin^Go) / k) = {G — Gmin) / {{kP + Go~Gmin) / k) = 
k. 



Be cause reduced reco mbination may result in aneuploidy of smaller chromosomes (j Warren et al. 



1987 



Brown et al 



20001) , it is conceivable that the length of smaller chromosomes could influ- 



ence genome-wide RR by introducing a lower bound for the propensity for chiasma formation 
in a species. Our analysis supports the size of the smaller chromosomes as a strong deter- 
minant of genome-wide RR for the six genomes studied in this paper (Appendix Figure A6). 
In log-log scale, the correlation coefficient between RR and the shortest chromosome length 
is —0.92 (p-value= 0.008). If the recombination rate is measured by k, in log-log scale the 
correlation coefficient is —0.91 (p-value=0.01). This correlation is nearly as strong as the re- 
ported correlation between RR and the total ph ysical length for over 100 genomes (cc= —0.99, 
p-value= 0.0003 on log-log scale) as reported in (jLynchll2006l ). Obviously, data on more species 
are needed for a more conclusive analysis. Nevertheless, it may be interesting to point out that 
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the genome with 



(iMikkelsen et al. 



;he lowest known recombina tion rate, Opossum, lacks any short chromosome 



2007 



Samollow et al. 



2007h . 



Obviously, any genome-wide analysis relies on the availability of high quality data. We are 
convinced that the data used in this study are of sufficient quality to study recombination on 
the chromosomal scale. However, some error might be introduced by the fact that the used 
genetic maps are not perfect and, in particular for telomeres, missing some data. This can 
be seen for the chicken genomes, where the two chromosomes fall below the minimum genetic 
length of 50cM in the older map and climb to about 50cM in the newer map (Appendix Figure 
A4). Data selectively missing crossovers at the telomeres might lead to an underestimation of 
Go in the regression model. 

We note that we restricted our analysis to chromosomes or chromosome-arms. If the genetic 
length is regressed over the length of much smaller regions, the coefficient of determination 
is expected to be much lower due to a mixture of recombination hot- and cold- spots. From 
a biological perspective, we also would not expect a positive Go value in such a regression, 
because there is no requirement for a Mb-sized region to have at least one chiasma to maintain 
meiotic integrity. 

In summary, we find that the introduction of the Go parameter helps to understand the 
recombination rate differences between species, because it separates the effect of the require- 
ment for at least one chiasma formation on smaller chromosomes from the factors that deter- 
mine the amount of recombination on larger chromosomes. More specifically, the partitioning 
of chromosome-scale recombination rate leads to the following list of conclusions: i) human 
male-female RR differences disproportionally affect larger chromosomes; ii) the higher recom- 
bination rate in the rat genome as compared to the mouse genome is likely to be caused by 
the higher number of smaller chromosomes that constitute the rat karyotype; iii) both chicken 
micro- and macro-chromosomes display a high RR and the extraordinarily high RR of some 
micro-chromosomes does not lead to an extraordinary excess of recombination on smaller chro- 
mosomes; iv) the honey-bee genome does not display any significant excess of recombination on 
smaller chromosomes; v) yeast displays a relative excess of recombination on smaller chromo- 
somes that is similar to higher organisms, despite its outstandingly high overall recombination 
rate; vi) recombination of the worm genome mainly occurs in telomeric regions and given one 
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recombination per chromosome arm, the hkehhood of a second recombination is determined 
by physical map length. These examples demonstrate that the proposed statistical frame- 
work allows to pinpoint differences of genomic recombination rate, which should be useful for 
the further study of genome-wide recombination rate as a quantitative trait of fundamental 
importance. 



Materials and methods 



Genetic map data 



The human genetic map was obtained from (IKong et al.ll2002l ) that uses 5136 microsatellite 



markers with 1257 meiotic events, and is estimated from pedigree data (Supplementary Table 
E: 



http:/ /www. nature.com/ng/journal/v31 /n3/suppinfo/ng91 7-Sl .html ) . The rat (Rattus norvegi- 
cus) a nd mouse (Mus musculus) genetic map data were obtained from Table 1 of (jJense-Seaman et al. 
20041 ). based on 2305 markers in rat and 4880 markers in mouse. T he two chicken (Ga llus gal- 



lus) genetic maps were obtained from Suppleme ntary Table S2 of 



is built from 1471 markers, and from Table 1 of (IGroenen et al.ll2009l ) built from 9258 mark 



Hillier et al. 



20041 ). which 



ers. The honey bee [Apis mellifera) genetic map was obtained from Table 2 of (IBeye et al 



20061 ) based on 1500 markers. The budding yeast {Saccharomyces cerevisiae) genetic map was 
downloaded from 



http://downloads.yeastgenome.org/chromosomalJeature/SGDJeatures.tah The worm {Caenorhab 



ditis elegans physical and genetic lengths o 



obtained from Table 1 of (IBarnes et al. 



central "gene clusters" and distal "arms" were 



19951 ) based on 168 markers. 



Measuring how good a linear regression is by coefficient of determination 

Regression analyses were carried out by the lm{) subroutine in R statistical package. For 
genetic lengths, {Gi} {i = 1,2, ■■■n, e.g., n = 22 for the chromosome-scale regression and 
n = 34 for the chromosome-arm-scale regression), one can regress them over sequence lengths 
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{Pi} {i = 1,2, ■ ■ ■ n) allowing ^/-intercept (non-zero G when P approaches 0): 

G = Go + kP, 

or, without the y-intercept {G approach as P approaches 0): 

G = kP. 



(9) 



(10) 



How good a linear regression model fits the data can be measured by the coefficient of 
determination B?, which is the proportion of variability that is explained by the model. More 
specifically, if SStot = Yll=ii^i " is the total sum of squares of the genetic lengths of 

chromosomes, the term SSerr = J2i=i{^i ~ Gq — kPiY for allowing non-zero y-intercept, or 
the term, SSerr = Sr=i(^* ~ kPiY for not allowing y-intercept, is the residual sum of squares 
(RSS), then 



SS,. 



RS S 



ss 



tot 



S S ] 



tot 



Model selection by Akaike information criterion 



Akaike information criterion (AlC) (lAkaikd Il974l ) of a statistical model is defined as 2p — 
2log{L) where p is the number of parameters in the model, and L is the ma ximum l i keliho od 



estimated from the data. Similarly, Bayesian information criterion (BIG) (jSchwarzl Il978 ) is 
defined as log{n)p — 2log{L), where n is the number of samples used to calculate the likelihood. 
For linear regression, AIC / BIG is related to the residual sum of squares (RSS) according to 
Jvenables and Riplev 1999 ) by: 

2p + n log{RSS/n) 



AIC 



BIG = \og{n)p + nlog{RSS/n) 



(12) 



where n is the number of sample points for the regression analysis. Between two statisti- 
cal models that are fitted to the same dataset, the model with a smaller AIC/BIC value is 
considered to be better than the model with a larger AIC/BIC value. 

For the comparison between the two- and one-parameter regressions, we have: 

RS Si 



AIG2 - AIGi 
BIG2 - BIGi 



n lo£ 



RS So 



log(r;,) — nlog 



RS Si 
RS S2 



(13) 
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If the second term, n\og{RS Si/ RS S2) , is larger than two (for AIC, or log(n) for BIG), then the 
two-parameter regression can be seen as the better model than the single-parameter regression. 

Quantities derived from Go 

The linear relationship between G and P cannot extend to the physical length of zero, if the 
y-intercept is greater than zero and the obligate chiasma requirement holds. Therefore, a point 
Pmin must exist below which genetic map length remains constant at Gmini independent from 
the actual physical map length of a chromosome. We can define this transition point as follows: 
Pmin is the physical length for which the regression line crosses the horizontal line defined by 
the minimum genetic length Gmm, thus Pmin = {Gmin — Go)/k. 

Another derived quantity is the genome-wide percentage of genetic length that is explained 
by Go- For a single chromosome (z), this percentage is = Gq/{Gq + kPi). For the whole 
genome, it is a = uGq / {nG^ + k^^ Pi), where n is the number of chromosomes. This definition 
of a is valid only when the ^/-intercept is positive (Go > ). 
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Appendix 

1 Marey Map for human chromosomes (Fig.Al) 
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Checking the normahty assumption of the regression (Fig.A2) 
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Testing the robustness of regression result by adding random 
noise to the genetic map (Fig. A3) 
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4 AIC and BIC of two- parameter regression models vs. one-parameter 
models (Table Al) 



source of genetic length data 




A 'mr~^ t:>t/^ tot/^ 
Al3lC= BiC2— BiCi 


human chromosome, sex-averaged 


/IOC 

— 4z.5 


/II /I 

—41.4 


lemale 


—45.3 


—44.2 


male 


OA C 


O O /I 

— 33.4 


human ch. arm, sex-averaged 


— oz.y 


— OLA 


female 


/I Q Q 

— 4o.o 


/I O Q 
— 4z.o 


male 


-43.6 


-42.1 


rat, chromosome, sex-averaged 


-9.9 


-8.9 


mouse, chromosome, sex-averaged 


-1.1 


-0.2 


chicken, chromosome, sex-averaged 


-28.7 


-27.5 


honeybee, chromosome, sex-averaged 


1.9 


2.7 


remove chromosome 1 


1.0 


1.7 


yeast, chromosome, sex-averaged 


-6.1 


-5.3 


worm, central region 


1.4 


0.96 


distal region 


-7.9 


-7.6 



Table Al: Difference of Akaike and Bayesian information criterion (AIC and BIC) between the two- and one- 
parameter regression models for modeling chromosome-scale recombination in 7 genomes. A negative AAIC 
or ABIC value favors the two-parameter regression model. 
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Chicken genetic length vs. physical length in log-log scale (Fig. A4) 



and Freudenberg 

Two- parameter regression model of Opossum chromosome 
netic length (Fig.A5) 
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Recombination rates of six genomes as a function of the smallest 
chromosome size (Fig.AG) 
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Figure 2: The genetic length (in cM) vs. physical length (in Mb) plotted for five g enomes: (A) mouse 



(cros ses) (Mus musculus) and rat (circles) {Rattus norvcgicus). Source: Table 1 of (jJense-Seaman et al 



2004) . The regression hues are: y = 15.62 + 0.44a:; (mouse), y = 22.49 + 0.43x (r at). (B) chicken ( Gallus 



gallus). Source: old data (year 2004, circl e) is from suppleme ntary Table A2 of (jHillier et al 



20041 ): new 



data (year 2008, cross) is from Table 1 of (jGroenen et al.l 120091 ). The regression line is: y = 34.68 + 2.79a:: 
(old data) and y = 34.23 + 2.04a; (new data). (C) honeybee {Apis mcllifcra). Source: Table 2 of 



(jBeve et al 



20061) . The regression line is y — —4.22 + 23.49a;. (D) budding yeast {Saccharomyces cere- 



visiac). Source: \http:// downloads. yeastgenome. org/chroinosomalJeature/SGD_features. tab The regression 
line is 2/ = 49.12 + 287.74a;. 
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x: telomeric regions 



X X 



o: centromeric regions 
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Figure 3: The cM-Mb plot using the physical and genetic length of central gene clusters (5 dat a points) and 
distal arms (10 data points) of five worm {Caenorhabditis elegans) chromosomes (Table 1 of (jBarnes et al. 
19951 )). The best fitting regression lines are y = 18.39 + 0.94a; for the distal/telomeric arms (crosses), and 
y — —2.22 + l.Olx for the central gene cluster regions (circles). 
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Figure Al: Marcy map for human genetic data (Supplementary Table E of (Kong et al., 2002). with y-axis 
showing the genetic distance (in cM) from the first marker to the current marker, and a;-axis showing the 
physical distance (in Mb). (A) Each line traces a chromosome (chromosome name is shown as label). (B) 
Each line traces an arm of a meta-centric chromosome (p- and q-arms are shown as label). The straight line 
indicates cM=Mb. 
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QQ-plot (y: regression residue, 



x: normal variable) 
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Figure A2: QQ-plots of the residuals of regression (e = G — Go — kP) against simulated normal variables. 
The variance of the normal variables is chosen to be equal to that of the regression residuals. Due to the small 
number of sample points (22 for human chromosomes, 34 for human chromosome arms), four sets of normal 
random variables were generated to test the robustness. In each QQ-plot, "o" denotes the regression for the 
sex-averaged map of human chromosomes, "f ' for the human female map data, "m" for human male map data, 
and "a" for sex-averaged human map data for chromosome arms (see Eqs.(l,2)). 
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Histogram of y-intercept with noise 
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Figure A3: Histogram of Go for human sex-averaged chromosome regression G ch^sex-ave.human = Go+kP when 
noise is added to the genetic length G. The noise was modeled as a normally distributed variable with zero- 
mean and standard deviation (sd) of 8.51517, which is the observed sd of regression residuals e = G — Gq — kP, 

for Gch, sex— ave. human ^ P Iri Ec[.(l). 
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Chicken genome 
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Figure A4: The genetic length (in cM) vs. physical length (in Mb) plots for chicken genome {Galliis gallus) 
in log-log scale (for linear-linear scale, sec Fig. 2(B)). 
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Opossum 




Figure A5: Two-parameter regression of Opossum Monodclphis domcstica chromosome genetic length (cM) 
over physical length (Mb). The regression coefficients for female map (red) are: G female = 54. 206 + 0. 003P, for 
male map (blue): Gmaie — 19.610 + 0.216P, and for sex-averaged map (black): Gsex-ave = 31.275 + 0.1 14P. 



Li and Freudenberg 



39 



yeast 



chicken 



o: RR=G/P 
+: regression slope k 



o 

honeybee 



T" 



T" 



T" 



T" 



human 



rat/mouse 



0.2 



0.5 



1.0 2.0 5.0 10.0 

smallest chromosome length (Mb) [log] 



20.0 



50.0 



Figure A6: Recombination rates of six genomes as a function of the smallest chromosome size. Genome-wide 
recombination rate is measured both by the genome- wide genetic-to-physical length ratio (RR ~ G/ P) (circles) 
and by the regression slope k (pluses). The plot is in log-log scale. 



